R语言中异常值的分析

2024-06-25 21:50| 来源: 网络整理| 查看: 265

R语言中异常值的分析——检测与删除异常值

作者：Safa Mulani

大家好，本篇文章会详细介绍R语言中异常值的分析。让我们开始吧！

什么是数据中的异常值？

深入理解异常值的概念前，首先让我们了解一下数据值的预处理。

在数据科学与机器学习的领域，数据预处理这个环节至关重要。通过数据预处理，我们可以在建模之前移除数据中的错误与偶然性误差。

现在，让我们来关注 R 语言中异常值的检测与删除。

异常值，顾名思义，指的就是远离数据集中其它点的数据点。这些数据值也因此干扰了数据集的整体分布。

通常我们认为这些值是分布异常的数据值。

异常值对模型的影响：

使数据格式变得歪歪扭扭在均值、方差等方面改变数据的整体统计分布使模型准确度有所偏差

在了解了异常值的影响后，是时候着手解决它们了。

开始吧！异常值分析！

首先，检测出数据集中异常值的存在十分重要。

让我们开始吧。以下内容中会用自行车租赁数量预测的数据集进行举例，您可以在这里找到这个数据集。

加载数据集

首先使用 read.csv() 函数在 R 环境中加载数据集。

在检测异常值之前，我们需使用 sum(is.na(data)) 函数进行缺失值检测，以防任何空值或数值的缺失。

#Removed all the existing objects 清空所有存在的对象rm(list = ls())#Setting the working directory 设置工作目录setwd("D:/Ediwsor_Project - Bike_Rental_Count/")getwd()#Load the dataset 加载数据集bike_data = read.csv("day.csv",header=TRUE)### Missing Value Analysis 缺失值分析###sum(is.na(bike_data))summary(is.na(bike_data))#From the above result, it is clear that the dataset contains NO Missing Values. 从结果中可以看出，数据集中不含有缺失值。

可以看出，数据中不含有缺失值。

使用 BoxPlot 函数检测异常值

是时候检测数据集中是否存在异常值了。为此，需要使用 c() 函数将数值数据列保存至一个另一个数据结构 / 变量中。

下一步，使用 boxplot() 函数检测数值变量中是否存在异常值。

箱型图： boxplot

从图中可以清楚地看出，‘hum’变量和‘windspeed’变量的数据值中包含异常值。

用 NULL 值替换异常值

在 R 中完成异常值分析后，可以将 boxplot() 方法识别出的异常值用 NULL 值替换，以便之后对其进行操作，如下所示。

##############################Outlier Analysis -- DETECTION 异常值分析 – 检测############################ 1. Outliers in the data values exists only in continuous/numeric form of data variables. Thus, we need to store all the numeric and categorical independent variables into a separate array structure. 数据值中的异常值仅以数据变量的连续/数字形式存在。因此，我们需要将所有数字和分了类的自变量存储到一个单独的数组结构中。col = c('temp','cnt','hum','windspeed')categorical_col = c("season","yr","mnth","holiday","weekday","workingday","weathersit")# 2. Using BoxPlot to detect the presence of outliers in the numeric/continuous data columns. 使用 BoxPlot 检测数值/连续数据列中是否存在异常值。boxplot(bike_data[,c('temp','atemp','hum','windspeed')])# From the above visualization, it is clear that the data variables 'hum' and 'windspeed' contains outliers in the data values. 从上面的图像中中，明显可以看出数据变量“hum”和“windspeed”在数据值中包含异常值。#OUTLIER ANALYSIS -- Removal of Outliers 异常值分析 – 清除异常值# 1. From the boxplot, we have identified the presence of outliers. That is, the data values that are present above the upper quartile and below the lower quartile can be considered as the outlier data values. 从箱型图中，我们已经确定了异常值的存在。也就是说，在上四分之一更上方和下四分之一更下方的数据值是异常数据值。# 2. Now, we will replace the outlier data values with NULL. 现在，我们将用 NULL 替换异常数据值。for (x in c('hum','windspeed')){ value = bike_data[,x][bike_data[,x] %in% boxplot.stats(bike_data[,x])$out] bike_data[,x][bike_data[,x] %in% value] = NA} #Checking whether the outliers in the above defined columns are replaced by NULL or not. 检查上述定义列中的异常值是否被 NULL 替换。sum(is.na(bike_data$hum))sum(is.na(bike_data$windspeed))as.data.frame(colSums(is.na(bike_data)))验证 NULL 是否替换了所有异常值

现在，用 sum(is.na()) 函数检查是否存在数据的缺失，即异常值是否已成功被缺失值替代。

输出：

> sum(is.na(bike_data$hum))[1] 2> sum(is.na(bike_data$windspeed))[1] 13> as.data.frame(colSums(is.na(bike_data))) colSums(is.na(bike_data))instant 0dteday 0season 0yr 0mnth 0holiday 0weekday 0workingday 0weathersit 0temp 0atemp 0hum 2windspeed 13casual 0registered 0cnt 0

可以看出，我们已成功将‘hum’列中的2个异常点与’windspeed’列中的13个异常点转换为缺失(NA)值。

删除带有空值的列

最后一步，我们需要处理这些缺失值。用‘tidyr’库中的 drop_na() 函数删除所有的 NULL 值。

#Removing the null valueslibrary(tidyr)bike_data = drop_na(bike_data)as.data.frame(colSums(is.na(bike_data)))

输出：

可以看出，现在所有的异常值都被成功清除了！

> as.data.frame(colSums(is.na(bike_data))) colSums(is.na(bike_data))instant 0dteday 0season 0yr 0mnth 0holiday 0weekday 0workingday 0weathersit 0temp 0atemp 0hum 0windspeed 0casual 0registered 0cnt 0结语

本教程介绍了 R 语言中如何检测与删除异常值，至此，我们的话题就结束了。如果您想查看有关 R 编程的更多教程帖，请继续关注本站！:)

【本文地址】

公司简介

联系我们